Transforming Large Collections of Scientific Publications to XML
نویسندگان
چکیده
We describe an experiment transforming large collections of LTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arχiv) using LaTeXML, a LTEX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) LTEX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that TEX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and macros, and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arχiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-yearsize undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.
منابع مشابه
Transforming the arχiv to XML
We describe an experiment of transforming large collections of LTEX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arXiv) using the LTEX to XML converter which is currently under development. The main technical task of our arXMLiv project is to supply LaTeXML bindings for the (tho...
متن کاملA standard TMF modeling for Arabic patents
Patent applications are similarly structured worldwide. They consist of a cover page, a specification, claims, drawings (if necessary) and an abstract. In addition to their content (text, numbers and citations), all patent publications contain a relatively rich set of well-defined metadata. In the Arabic world, there is no North African or Arabian Intellectual Property Office and therefore no u...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملApply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML
As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...
متن کاملCroatian National Centre for Biobanking – a new perspective in biobanks governance?
Ethical issues in biobanking, as well as organization and management of biobanks, have become permanent topic of scientific publications in the last 15 years (1). The Expert Group of the European Commission Defines biobanks as collections of various types of biological samples (cells, tissues, blood, DNA) plus related databases. They can be small collections or large national repositories, popu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Mathematics in Computer Science
دوره 3 شماره
صفحات -
تاریخ انتشار 2010